This tutorial is a primer, on how to go modular and use Docker 🐳 containers for your bioinformatics analysis tasks. More specifically, we will learn how to use the Deploit platform to assemble and deploy on cloud a reproducible and sharable bioinformatics workflow.
We will assemble the following resources into a workflow on the Deploit platform:
fastq.gz file, fetched from an EMBL-EBI FTP siteAfter completing this mini workflow on the Deploit platform, you will have:
1) a plot-full FastQC html report,
with key metrics to assess the quality of your FASTQ file
2) sharable links to your Job Page s ,
with interactive plots and information about the resources and the results.
You can access the Jobs Pages from the sharable urls we created for this example:
wgetGunzipper fastQsee and take a look at what a Job Page looks like below:
For every job run on Deploit, a Job Page report is created.
In principle, you could actually easily install the dependencies for running FastQC on your own machine. But this tutorial is more about learning how to use Deploit to easily combine resources (code from GitHub, containers from Docker Hub) to assemble multi-step bioinformatics workflows. This fastQsee pipeline will serve as our dummy example to go through the steps. Deploit enables you to structure your workflows as an assembly of individual, self contained units of computation (jobs), by bringing all the required resources to run an analysis (data, code, os & tools, computational resources) in one place. Each step in a bioinformatics workflow, will most likely utilize different tools and have different dependencies. But why not install all the tools on one machine, and run everything there, right? Well, for starters, dependencies!
We all know that it’s a hustle to make all the tools play nice together. There are several reasons why bioinformaticians have started joining developers and data scientists, and are slowly abandoning the monolithic, all-in-one place analysis environment approach and shifting towards more robust, modular and portable environment solutions. While virtualization and containers have been around for quite some time, Docker has really revolutionized the way we work the past few years. In a bioinformatics workflow, each process can utilize a different container as an execution microenvironment, with a main focus on preventing dependencies conflicts and ensuring reproducibility.
This modularity also unlocks many cool features as a positive side effect:
1) Portability : Installation (and hustle!) free run on any another machine
2) Cloudability : Easily deploy on cloud
3) Reproducibility : Allow for someone else to run the shame pipeline
4) Frictionless Retouching: Allows for easily removing, adding or retouching individual processes without affecting the rest
5) Isolation of Dependencies: Conflicting dependencies are isolated
6) Same tool, different tool version: Ability to use the same tool, but a different version of it in different processes if needed (legacy code in bioinformatics tools anyone?)
The analysis journey of NGS generated FASTQ files should always start, as with any other data analysis task, with a robust E xploratory D ata A nalysis (EDA) bout.
The FastQC tool (Andrews S. et al, 2010), facilitates this task by providing a plot-ful html report, with key metrics for read quality.
FastQC quality assesment plots
list_of_img = c("https://github.com/cgpu/fastQsee_helper_repo/blob/master/images/per_base_sequence_quality.png?raw=true",
"https://github.com/cgpu/fastQsee_helper_repo/blob/master/images/per_sequence_content.png?raw=true",
"https://github.com/cgpu/fastQsee_helper_repo/blob/master/images/per_sequence_quality_scores.png?raw=true",
"https://github.com/cgpu/fastQsee_helper_repo/blob/master/images/per_tile_sequence_quality.png?raw=true"
)
slickR::slickR(obj = list_of_img,
padding = "10" ,
width = '70%',
height = 300
)
If not familiar with how to interpret the plots, you can start by taking a look at the good and bad quality sequencing data examples that are available in the FastQC tool official webpage. There is also a really great presentation on “RNA-seq quality control and pre-processing” by Mikael Huss you can check out here:
In this tutorial, we will bring all the required resources to generate a FastQC report on the Deploit platform. For the most part the resources we will need can be provided in the form of links, that point to the following four fundamental ingredients of any data analysis pipeline:
The main ingredient. For this example we will use a FASTQ.gz file from the 1000genomes project. We only need the url that points to the file, hosted in the an EMBL-EBI FTP server. You can find the one we selected in the following link:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/sequence_read/SRR062634.filt.fastq.gz
(link to parent repo)
The instructions, the recipe for the transformations that will be performed on our raw data. The FastQC tool takes as input uncompressed FASTQ format data. So we need code for two main tasks to obtain the FastQC report.
We will use the following >_ bash command to do exactly that:
wget: fetches from FTP server and downloadgunzip: uncompresses .gz file>: renameswget -O - https://lifebit.page.link/ftp_SRR062634_fastq_gz | gunzip -c > SRR062634.fastq
For a more detailed breakdown of the command powered by Ubuntu’s manpage repository, you can check ExplainShell.com. Feel free to swap the shortened https://lifebit.page.link/ftp_SRR062634_fastq_gz with the original ftp link provided above. The output of this command will be the uncompressed FASTQ file named SRR062634.fastq .
For this we will use one of the pipelines already available in the PIPELINES > PUBLIC PIPELINES & TOOLS section on the Deploit platform. The only input required for the report is the uncompressed FASTQ file we acquired from the previous step.
The environment and tools that will facilitate the transformation of our data.
We will obviously use Docker , the FastQC tool and its dependencies (eg. a suitable J ava R untime E nvironment). However, no installation is required on your machine because everything can be available on the Deploit platform, installation free. Docker containers will serve as our software microenvironments that will host each task of the workflow.
For the first task mentioned above, to retrieve and uncompress the fastq.gz file, we will port a Docker container with a lightweight Linux distribution (we only need >_bash for wget and gunzip) from Docker Hub by providing the link to the respective repository. You can find the docker container we have chosen (with Alpine Linux) in the following link:
https://hub.docker.com/r/bashell/alpine-bash
We will use this Docker container for our combo wget | gunzip > command and deploy the job over cloud from the Deploit platform. The uncompressed file will be available in the DATA > JOB RESULTS section in the platform and available to be used as input in other pipelines.
All we need for this is to select the fastqc pipeline from the library of curated pipelines available on the Deploit platform. As input, we will use the output file from the previous step, the uncompressed FASTQ file. The FASTQ file can be accessed in the DATA > JOB RESULTS section in the platform.
Power! To spin all these up and generate our results.
Deploit brings all four required resources in one place. Deploit orchestrates the deployment of your jobs over cloud. If you don’t have a cloud account yet, you can still try the platform. Upon registration, we provide you with a Lifebit cloud account with preloaded credits, so that you run your first analyses . If you want to have access to your own resources (data, credits) you can link your own cloud account.
Now that we have an overview of how Deploit brings your resources in one place, and we have found what resources we will need, time to go back to the fastQsee tutorial to generate the FastQC report.
All the resources that we will need can be summarized in the following table:
| What | Where |
|---|---|
| DATA | SRR062634.fastq.gz (1000genomes example file) |
| CODE | wget -O - https://lifebit.page.link/ftp_SRR062634_fastq_gz | gunzip -c > SRR062634.fastq |
| OS/TOOLS | frolvlad/alpine-bash & lifebitai/fastqc Docker containers |
| RESOURCES | Lifebit Cloud (provided with registration) |
Let’s head over to the Deploit platform to generate the FastQC report step-by-step.
Project for your analysis tasksFor generating the FastQC report, we will deploy two jobs:
1) One for retrieving the file
2) One for running the FastQC tool.
It is advised to create a Project to host the individual tasks/jobs of a workflow. Think of the Project entity in the Deploit platform, as your parent directory for the project. There, you will have access not only to the data and code, but also to all the Jobs that have been run. You will have the ability to revisit, clone and deploy again the same jobs very easily.
Log in to your Deploit account and access the Projects section from the light bulb icon 💡 on the left of your screen. Click on the green New button on the right, and provide a Name and Description for your new Project.
We set up the Project for this example by filling in:
Name: “fastQsee”Description: “Quickly generate a FastQC report”You can find an overview of how these steps look on Deploit below:
After you log in to Deploit, find the Pipelines section from the navigation bar on the left of your screen. We will create a new pipeline in the Pipelines > My Pipelines & Tools section.
This is how we will port the frolvlad/alpine-bash Docker container to Deploit, so that we can it use it for retrieving the fastq.gz file. Have the link to the Docker repository ready for copy+pasting
URL to Docker Hub: hub.docker.com/r/frolvlad/alpine-bash Have the link to the Docker repository ready for copy+pasting and then go ahead and click the green New button on the right.
As shown above, you will be prompted to select “Where are you porting your pipeline from?”. Click on the Docker whale and then click on Select to proceed. Continue by filling in the required fields to port the container. You can have a look at how we set this up below as an example:
Docker hub URL: https://hub.docker.com/r/frolvlad/alpine-bashName : “wgetGunzipper”Default command: (leave this blank)Description: "A lightweight Linux distro for wget/gunzipThe Default command is handy for saving time, but for now we will leave this blank so that we can use the command field as a terminal.
You can see an overview of the process described above in the following gif:
fastq.gz file on DeploitTime to use our newly created pipeline, and utilize the Docker container that we ported. We will use the command field on Deploit as a terminal, and the Docker container as our operating system to download and decompress the fastq.gz file we selected for this example. We will then deploy the job and come back to find our FASTQ file in the Data > Job Results section on the Deploit platform.
We will run the following command as mentioned earlier:
wget -O - https://lifebit.page.link/ftp_SRR062634_fastq_gz | gunzip -c > SRR062634.fastq
Have it ready for copy+pasting, and let’s go back to the Deploit platform. Access the newly created pipeline that utilizes the Alpine Linux docker container (we named ours wgetGunzipper) by clicking:
Pipelines > My Pipelines & Tools > wgetGunzipper
Paste the command from above in the Executable field. Take a look at how this will look on the Deploit platform below:
You can see an overview of the final command in the bottom of the screen:
Now we are ready to deploy the first of the two jobs for retrieving the FastQC report.
Click on Next on the top right of your screen. You will be redirected to the page where you will:
Project, in which your job belongs to (we selected fastQsee)When you select both, you are ready to submit your job. Go ahead and click Run job. You will be redirected to the Jobs page. Your job will be scheduled, initialized and be completed shortly. Take a look below to check how these steps should look in the Deploit platform.
After job completion, you can access the decompressed FASTQ file in the Data > Job Results section. We expect to find our FASTQ file in the fastQsee Project folder, with a filename as we defined it when retrieving it SRR062634.fastq.
Reminder: This is the command we submitted to download, decompress and rename the fastq.gz file:
lifebitai/fastqcOur input file for the next and last step, the uncompressed FASTQ file is now available on the Deploit platform in the Data > Job Results section. We will use this file as input for the lifebitai/fastqc pipeline, which is essentially a dockerized version of the FastQC tool and its dependencies. In the Pipelines > PUBLIC PIPELINES & TOOLS section, start typing "fastqc" in the search bar to easily find the lifebitai/fastqc pipeline.
Time to set up the pipeline. Let’s:
Project that the pipeline will be associated with As shown above we have selected as input data, the 1000geomes FASTQ file we fetched from the EMBL-EBI FTP server in the previous step. None other parameteres are required to run the fastqc pipeline. We then selected an instance (anything larger than 1 CPU would work).
All curated pipelines included in Deploit’s library, in the Pipelines > PUBLIC PIPELINES & TOOLS section, come with example parameters and data. Click
Try with example data & parameters to:
from the Deploit fields:
or
from the final command field:
This way you can see how the command should look like when you customize the pipeline with your own parameters and input data. For example, as shown above we can see that the lifebitai/fastqc pipeline can be run just by typing:
fastq name_of_my_fastq_file.qc
Notice the .fq file format required and None other parameters.
You can run with example parameters & data just to check the output files that the pipeline generates, and explore new pipelines and tools you haven’t used before for your own omics data. You might discover another way you can interrogate your omics data and generate more results to inspect.
Once the job has been completed, you can access your results from the Job Page, as shown below:
The Job Page serves like a summary report that includes information about:
Job status: Job progress (% complete) or if it has failedResource Monitor: CPU usage, RAM requirementsTable overview: Runtime, number of processes, total costIt also works as a portal to access all the generated output files in the Results section. Every job is assigned a unique Job ID, so that you can reference back to them or to retrieve programmatically information about the job using your private key through Deploit’s restful API.
Many thanks to Phil Palmer and Diogo Silva for reproducibility and Docker feedback.
sessionInfosessioninfo::package_info()## package * version date lib source
## assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.2)
## base64enc 0.1-3 2015-07-28 [1] RSPM (R 3.5.2)
## cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.2)
## crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.2)
## digest 0.6.18 2018-10-10 [1] RSPM (R 3.5.2)
## emo 0.0.0.9000 2019-01-29 [1] Github (hadley/emo@02a5206)
## evaluate 0.12 2018-10-09 [1] RSPM (R 3.5.2)
## glue 1.3.0.9000 2019-01-29 [1] Github (tidyverse/glue@8188cea)
## htmltools 0.3.6 2017-04-28 [1] RSPM (R 3.5.2)
## htmlwidgets 1.3 2018-09-30 [1] CRAN (R 3.5.2)
## jsonlite 1.6 2018-12-07 [1] RSPM (R 3.5.2)
## knitr 1.21 2018-12-10 [1] RSPM (R 3.5.2)
## lubridate 1.7.4 2018-04-11 [1] RSPM (R 3.5.2)
## magrittr 1.5 2014-11-22 [1] RSPM (R 3.5.2)
## purrr 0.3.0 2019-01-27 [1] CRAN (R 3.5.2)
## Rcpp 1.0.0 2018-11-07 [1] RSPM (R 3.5.2)
## rlang 0.3.1 2019-01-08 [1] CRAN (R 3.5.2)
## rmarkdown 1.11 2018-12-08 [1] RSPM (R 3.5.2)
## rstudioapi 0.9.0 2019-01-09 [1] CRAN (R 3.5.2)
## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.2)
## slickR 0.2.4 2018-03-06 [1] CRAN (R 3.5.2)
## stringi 1.2.4 2018-07-20 [1] RSPM (R 3.5.2)
## stringr 1.3.1 2018-05-10 [1] RSPM (R 3.5.2)
## withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.2)
## xfun 0.4 2018-10-23 [1] RSPM (R 3.5.2)
## xml2 1.2.0 2018-01-24 [1] CRAN (R 3.5.2)
## yaml 2.2.0 2018-07-25 [1] RSPM (R 3.5.2)
##
## [1] /home/rstudio-user/R/x86_64-pc-linux-gnu-library/3.5
## [2] /opt/R/3.5.2/lib/R/library
sessioninfo::platform_info()## Warning in system("timedatectl", intern = TRUE): running command
## 'timedatectl' had status 1
## setting value
## version R version 3.5.2 (2018-12-20)
## os Ubuntu 16.04.5 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate C.UTF-8
## ctype C.UTF-8
## tz Etc/UTC
## date 2019-02-03
## Warning in value[[3L]](cond): beep() could not play the sound due to the following error:
## Error in play.default(x, rate, ...): no audio drivers are available
```